Reading in data files
Humboldt-Universität zu Berlin
Wed, Nov 29, 2023
required reading: Ch. 8 (data import) in Wickham et al. (2023)
supplementary reading: Ch. 4 (data import) in Nordmann & DeBruine (2022)
So far, we’ve learned how to…
dplyr verbsToday we will learn how to:
.csv)readr packagepacman package instead of install.packages() and library
p_load() takes package names as argumentslibrary())install.packages() + library())tidyverse loaded, and the new packages janitor and here installed and loaded
?janitor and ?here in the Console.there are many different file types that data can take, e.g., .xlsx, .txt, .csv, .tsv.
csv is the most typical data file type, and stands for: Comma Separated Values.
this is what a simple CSV file looks like when viewed as raw text
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
Figure 1: Source: Wickham et al. (2023) (all rights reserved)
RProjects
Our spreadsheet
groesse_geburtstag.csv to your computer, directly in a folder called daten in our project directoryAufgabe 1: Saving a CSV
Example 1
daten in your project folder (if you haven’t already).daten folder as groesse_geburtstag.csv
daten folder and check that the CSV file is there.readr packagewe now have to read in the data
we have to use a function that reads CSV data, and to specify where the data is in our RProject folder
the readr package (part of tidyverse) can load in most data types, and has multiple functions for different data types
| Haustier | Größe | Monat der Geburt | Tag | L1 |
|---|---|---|---|---|
| Lola | 171 | 5 | 7 | Englisch |
| NA | 168 | 11 | 26 | Deutsch |
| N/A | 182 | 4 | 15 | Deutsch |
Aufgabe 2: readr
Example 2
groesse_geburtstag.csv dataset and save it as an object called df_groesse
df_ is short for DataFrame; it’s a good idea to use a prefix before object names so we know what each object containsread_csv, some information is printed in the Console. What is printed?summary() or head()
here packagedaten folder?here()
here() is starting from, run here()
[1] "/Users/danielapalleschi/Documents/IdSL/Teaching/WiSe2324/B.A./r4ling"
here packageImage source: Allison Horst (all rights reserved)
NA or N/A values
N/A was written as text in one of our observations, and so R reads it as suchNAs in R refer to missing data (“Not Available”)N/A written in our df_groesse data is not actually read as a missing valuena = for the read_csv() function, which tells read_csv() which values it should equate with missing values# A tibble: 3 × 5
Haustier Größe `Monat der Geburt` Tag L1
<chr> <dbl> <dbl> <dbl> <chr>
1 "Lola" 171 5 7 Englisch
2 "" 168 11 26 Deutsch
3 <NA> 182 4 15 Deutsch
"" is read as an NA
read_csv() reading empty cells as NA
read_csv() to read more than one type of input as NA, i.e., we want to tell it to read "" and "N/A" as NA
c().`Monat der Geburt`)
clean_names() from the janitor package, which we’ve already loaded in# A tibble: 3 × 5
haustier grosse monat_der_geburt tag l1
<chr> <dbl> <dbl> <dbl> <chr>
1 Lola 171 5 7 Englisch
2 <NA> 168 11 26 Deutsch
3 <NA> 182 4 15 Deutsch
head(df_groesse), do you see the cleaned column names?<-
There are currently 2 pipes that can be used in R.
magrittr package pipe: %>%
|>
Cmd/Ctrl + Shift/Strg + M to produce a pipeAufgabe 3: pipes
Example 3
groesse_geburtstag.csv dataset again with fixed NAs and then
clean_names() on the dataset, and then
head() functiongroesse_geburtstag.csv dataset again with fixed NAs, saving it as the object df_groesse, and then
clean_names() on the data sethead() function when you’re saving the dataset as an object?readr has other functions which are also easy to use, you just have to know when to use which ones
read_csv2() reads semicolon-separated csv files (;)
, as the decimal marker (like Germany)read_tsv() reads tab-delimited files
read_delim() function reads in files with any delimiter
delim = (e.g., read_delim(groesse_geburtstag.csv, delim = ","))numerical and factor (categorical)numerical data
month contains numbers, but it could also contain the name of each monthnumerical variable, but not of a factor
as_factor()as_factor() function to change a variable type to factor$ to index a column in a dataframe:tidyverse syntax and the mutate() functionToday we learned how to…
readr package ✅Let’s now put this new knowledge to use.
Let’s now practice using the readr package and wrangling our data.
readr functions|”?read_csv() and read_tsv() have in common?;) as delimiter?Re-load the groesse_geburtstag.csv file. Use pipes to also use the clean_names function, and to make the following changes in the object df_groesse:
l1 to a factor.grosse to groesse
monat_der_geburt to geburtsmonat
df_groesse dataset, visualising the relationship between our birth day and our birth days (this doesn’t make sense to compare, but it’s just an exercise). Set the colour and shape to correspond to L1. Add a plot title.l1
Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.9.0.463 (Desert Sunflower).
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magick_2.7.4 patchwork_1.1.3 here_1.0.1 janitor_2.2.0
[5] languageR_1.5.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[9] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[13] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 stringi_1.7.12 hms_1.1.3
[5] digest_0.6.33 magrittr_2.0.3 evaluate_0.21 grid_4.3.0
[9] timechange_0.2.0 fastmap_1.1.1 rprojroot_2.0.3 jsonlite_1.8.7
[13] fansi_1.0.4 scales_1.2.1 cli_3.6.1 crayon_1.5.2
[17] rlang_1.1.1 bit64_4.0.5 munsell_0.5.0 withr_2.5.0
[21] yaml_2.3.7 parallel_4.3.0 tools_4.3.0 tzdb_0.4.0
[25] colorspace_2.1-0 pacman_0.5.1 vctrs_0.6.3 R6_2.5.1
[29] lifecycle_1.0.3 snakecase_0.11.0 bit_4.0.5 vroom_1.6.3
[33] pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.4 Rcpp_1.0.11
[37] glue_1.6.2 xfun_0.39 tidyselect_1.2.0 rstudioapi_0.14
[41] knitr_1.44 htmltools_0.5.5 rmarkdown_2.22 compiler_4.3.0
Woche 7 - Datenimport